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Abstract: In multivariate nonparamotric analysis, sparseness of the co- 
variates also called curse of dimensionality, forces one to use large smoothing 
parameters. This leads to a biased smoother. Instead of focusing on opti- 
mally selecting the smoothing parameter, we fix it to some reasonably large 
value to ensure an over-smoothing of the data. The resulting base smoother 
has a small variance but a substantial bias. In this paper, we propose an 
R package named ibr to iteratively correct the initial bias of the (base) 
estimator by an estimate of the bias obtained by smoothing the residuals. 
After a brief description of Iterated Bias Reduction smoothers, we examine 
the base smoothers implemented in the packages: Nadaraya- Watson kernel 
smoothers and thin plate splines smoothers. Then, we explain the stopping 
rules available in the package and their implementation. Finally we illus- 
trate the package on two examples: a toy example in and the original 
Los Angeles ozone dataset. 

Keywords and phrases: multivariate smoothing, L2 boosting, thin-plate 
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1. Introduction 



Regression is a fundamental data analysis tool for uncovering functional rela- 
tionships for the conditional expectation from pairs of observations {Xi,Yi), i = 
1, . . . ,n. Classical linear regression is the simplest example of this. More gen- 
erally, we can let the data help determine the general form of the relationship 
by using one of the numerous non-parametric regression estimators, such as 
wavelets based smoothers, kernel smoothers, and splines smoothers [Buja ct al.. 
1989, Cleveland and Devlin, 1988, Eubank. 1988, Fan and Gijbels, 1996, Antoniadis and Oppenheim, 
1995, SimonofF, 1996]. The latter flexible smoothing methods are implemented 
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as R functions found in numerous contributed packages. For instance the pack- 
age wavethresh [Nason, 2010] implements a wavelet based smootlier, the pack- 
age lokern [for R and enhanced by Martin Maechler, 2010] provides a kernel 
smoothers and the function smooth, spline calculates a cubic spHne smoother. 
When the number of dependent variables d is greater than 3 or 4, fully non- 
parametric regression suffers from the curse of dimensionality, even for moderate 
sample sizes (say n being equal to a few hundred). As a result, application of fully 
non-parametric methods are discouraged in dimensions four and higher. Instead, 
the statistical literature encourages using constrained non-paramctric regression 
models (additive models [Hastie and Tibshirani, 1990], single and multiple index 
models and projection pursuit models) to estimate useful approximations of the 
conditional expectation. The latter methods are provided to the R community 
in the contributed package mgcv [Wood, 2011] for additive modelling, function 
ppr for projection pursuit and package mda [Hastie et al., 2011] for MARS. 

Originating from the machine learning community, the boosting algorithm is 
also another tool for non-parametric regression [see Friedman, 2001, and refer- 
ences therein]. This fairly recent and very popular method has numerous vari- 
ations, such as adaboost (the original method), logitboost for classification, 
and the L2 boosting for regression. The interesting feature is that it provides 
a framework for combining various weak learners (non-parametric smoothers) 
into a smoother that is better than any single smoother that it is composed 
off. Packages for L2 boosting are already available in R: for instance the pack- 
age mboost [Hothorn et al., 2010] allows for L2 boosting for regression problem 
as well as logistic boosting for classification. For multivariate regression, the 
L2 boosting algorithm has been applied to component-wise additive modelling 
with classical smoother such as smoothing splines [sec Biihlmann and Hothorn, 
2007]. 

Linking the L2 boosting algorithm to an iterative bias correction scheme (see 
for example Biihlmann and Hothorn [2007] for boosting of smoothing splines 
and Cornillon et al. [2011] for discussions on L2 boosting of more general smoothers) 
provides a statistical interpretation of the L2 boosting algorithm. This interpre- 
tation was alluded to in Ridgeway [2000] 's discussion of Friedman et al. [2000] 
paper on the statistical interpretation of boosting. The basic idea of estimat- 
ing (and correcting for) the bias of a pilot smoother goes back to the concept 
of twicing introduced by Tukey [1977]. The idea of iterating the bias correc- 
tion was central to adaptive bagging algorithm of Breiman [1999]. More details 
about statistical properties in univariate or multivariate smoother can be found 
in Buhlmann and Yu [2003] or Cornillon et al. [2011]. 

This paper focuses on the computational implementation in R [R Development Core Team, 
2009] of the iterated bias correction procedure for fully multivariate regression 
smoothers. We start in Section two by briefly presenting the concept of iterative 
bias reduction and recalling its connection to L2 boosting. The details of our 
numerical implementation and a review of available options in our R package 
ibr are given in Section three. The last section is devoted to examples. 
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2. Iterative bias reduction smoothers 
2.1. Method 

Suppose that the pairs (X^, F^) G M'^ x R are related through the non-parametric 
regression model 

Yi = m{X^)+ei, i = l,...,n, (1) 

where m(-) is an unknown smooth function, and the disturbances are inde- 
pendent mean zero and variance random variables that are independent of 
all the covariates. It is helpful to rewrite Equation (1) in vector form by setting 
Y = (Yi, . . . ,y„)*, m = {m{Xi), . . . ,m{Xn)f and e = (ei, . . .,£„)*, to get 

Y = m + s. (2) 

Linear smoothers estimate the regression function m evaluated at the covariates 
by linear combinations of the responses that can be compactly written as 

mi = SxY, (3) 

where is an n x n smoothing matrix with smoothing parameter A. By 
slight abuse of language, we will sometimes refer to the vector of fitted val- 
ues fh = Y = (Yi, . . . , YnY as the smooth of Y. Typical smoothers [see for in- 
stance Hastie et al., 2001] include bin smoothers, spline based smoothers (regres- 
sion splines, smoothing splines, and thin-plate splines), kernel based smoothers 
(Nadaraya- Watson kernels and local polynomials smoothers), and series based 
smoothers (Fourier smoothers and wavelet smoothers). In this paper, we focus 
only on two common types of smoothers: Nadaraya- Watson kernels (where A 
is the bandwidth) and thin-plate splines (where A is the penalty parameter). 
Extensions to other smoothers can easily be achieved by suitably modifying our 
theoretical results and software. 
The linear smoother (3) has bias 

B{mi) = E[mi\X] - m = {Sx - I)m 

and variance 

V{fh^\X) = {SxSi) a\ 

To estimate the bias, observe that the residuals Ri =Y — fhi = (/ — S\)Y have 
expected value E[Ri\X] = m — E[fhi\X] = {I — S\)m — —B{fhi). This suggests 
estimating the bias by smoothing the negative residuals 

Si := -SxRi = -Sx{I-Sx)Y. 

Recall that, in multivariate non-parametric analysis, sparseness of the covariates 
also called curse of dimensionality, forces one to use large smoothing parameters 
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A. This leads to very biased base smoother Sx- Thus the bias correction in 
multivariate non-parametric analysis arises as a natural tool to correct classical 
smoother iSa- If A is large, not all the bias is usually removed after a the first 
correction. To remove the remaining bias, iterations of the bias reduction step 
have to be performed. For instance k—l bias reduction step produces the linear 
smoother at iteration k: 



When d ~ 1, the sequence of iterated bias corrected smoothers agrees with the 
L2-boosted smoothers without shrinkage. For d > 1, the boosting algorithm is 
applied component-wise to additive regression models [see Biihlmann and Yu, 
2003] . This results in a sequence of constrained (additive) approximation of the 
fully non-parametric regression function m. 

For thin-plate splines and kernels smoothers (with suitable kernels, such as 
as a Gaussian density function), each iteration of the bias correction produces 
a noisier but less biased smoother. In the limit, the sequence of iterative bias 
corrected smoothers reproduces the raw data [Cornillon et al., 2011]. Thus there 
is a need for good stopping rules for the iterative bias correction algorithm. 

To illustrate this behavior, let us use a classical bivariate regression problem: 
figure 1 graphs Wendelberger's test function [Wendelberger, 1982]: 



The sequence of bias corrected thin-plate spline smoothers, starting from a 
pilot that over-smooths the data, converges to an interpolant of the raw data 
(see figure 2 (c)). After some suitable number of bias correction steps, the re- 
sulting bias corrected smoother will be a good estimate for the true underlying 
regression function (see figure 2 (b)). The crucial choice of k is achieved by the 
use of classical criterion such as corrected AIC or GOV (see section 2.2). 

We note that, provided A is large enough, its exact value is not crucial as the 
choice of k will adapt to A: if two base smoothers are chosen, one with Ai and 
another with A2 > Ai, the chosen iteration k2 will be greater than fci as it takes 
more iterations with a very smooth base smoother to remove the bias. 

To make prediction at arbitrary locations x GMf^ of the covariates, we extend 
linear smoothers to functions of the form 



SxY + Sx{I - Sx)Y + • • ■ + Sx[I - Sxf~^Y 
{I -{I- Sx)'')Y. 



(4) 




+ ^ exp {-((9.T + 1)2/49 + (9y + 1)^10)} + 
-Kiexp{-((9.T-7)2 + (9y-3)2)/4)}- 
-iexp{-((9x-4)2 + (9y-7)2)}. 



(5) 



7fl(x) 



SxixfY, 



(6) 
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(a) (b) (c) 




Fig 2: Thin-plate spline regression smoothers from 100 noisy observations from 
5 (see Figure 1) computed on a regular grid on [0, 1] x [0, 1]. Panel (a) shows the 
pilot smoother, panel (b) graphs the bias corrected smoother after 500 iterations 
and panel (c) graphs the smoother after 50000 iterations of the bias correction 
scheme. 



where S{x) is a vector of size n whose entries are the weights for predicting 
m{x). The vector S{x) reduces to the j^^ row of the smoothing matrix when 
X = Xj, and is readily computed for many of the smoothers used in practice. 

For extended base smoothers of the form (6), we propose to extend the asso- 
ciated iterative bias corrected smoother fhk by observing that 

fhk ^ fha + bi ^ h fofe (7) 

- Sx[I + {I - Sx) + {I - Sx)^ + ■ ■ ■ + {I - Sx)''-^]Y (8) 

= SxA- (9) 

This implies that fhk{Xj) = SxiXjY (3k- Hence we propose to extend the itera- 
tive bias corrected smoother to R'' via the function 

mk{x) = Sx{x)%. (10) 
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2.2. Stopping Rules 

Selecting a suitable number of bias correction iterations k is crucial. There 
exists an unknown number k of bias corrections iterations of thin-plate spline 
base smoothers that produces estimators for the regression function that achieve 
the optimal rate of convergence for mean square error (see [Biihlmann and Yu, 
2003] for the univariate case and [Cornillon ct al., 2011] for the multivariate 
counter-part). This optimal unknown number of iterations can be estimated 
consistently from data using GCV [Cornillon et al., 2011]. 

But one can select the number of iterations from data using classical model 
selection methodologies such as: GCV [Craven and Wahba, 1979], AIC [Akaikc, 
1973], BIC [Schwarz, 1978], AICc [Hurvich et al., 1998] or gMDL [Hansen and Yu, 
2001]. In particular, use of AICc is advocated in Biihlmann and Hothorn [2007] 
and empirical evidence for GCV can be found in Cornillon et al. [2008]. Other 
methods such as cross-validation (leave-one-out or .ftT-fold) or the use of training 
set and test set are also reasonable procedure to estimate k [Biihlmann and Hothorn, 
2007]. Both empirical [Cornillon et al., 2011] and theoretical [Cornillon ct al., 
2011] considerations support using GCV in practice. 

2.3. Choice of kernel for kernel smoothers 

The behavior of the sequence of iterative bias corrected kernel smoothers depend 
critically on the properties of the smoother kernel. Specifically, the smoothing 
kernel needs to be positive definite [see Di Marzio and Taylor, 2008, Cornillon et al., 
2011]. Examples of positive definite kernels include the Gaussian and the trian- 
gle densities, and examples of kernel that are not definite positive include the 
uniform and the Epanechnikov kernels. 

3. Implementation in R 

Our implementation of the iterative bias corrected procedure in R follows the 
established S3 methods [see writing R extensions R Development Core Team, 
2009]. The main function (called ibr) produces an object of class ibr. Applying 
generic functions, such as summary, predict, plot or residuals, to an ibr class 
object produces the expected standard summary statistics, prediction for new 
data (or fitted values), plot of the object and residuals. 

3.1. Base smoother 

Two types of base smoother are implemented in the function ibr: thin-platc and 
kernel smoother. This choice is driven by the smoother argument (character): 
tps or k. For kernel smoother, some classical choice are available using the 
kernel argument (character): Gaussian kernel (g, the default), triangle density 
(t), and the quartic (q) density. The computations have been optimized for the 
Gaussian kernel. Argument kernel enables the use of Epanechnikov kernel (e) 
or uniform (u) kernel but only for pedagogical purposes. 

imsart-generlc ver. 2009/08/13 file: ibr3.tex date: January 21, 2013 



/Iterative Bias Reduction 



7 



3.2. Computations 

To predict new data, we compute recursively /3k using equations (8) and (9). 
Computation of the fitted values using Equation (4) can be computed using a 
similar recursive update formula: starting with bo = {I — Sx)Y: 

rhk =Y - {I - Sx)bk-i and bk-i = {I - S\)bk.^2- 

Computations of cither b^ or require 0{kin?) operations and is implemented 
numerically by using the corresponding level 2 Bias function [Golub and Van Loan, 
1996]. In practice, we often found that the number of iterations k that are re- 
quired to be evaluated in order to select an good data-driven choice k is com- 
mensurate with the sample size n. Thus the algorithm that produces the final 
smoother is typically of order O(n^). 

Numerical experiments have shown that an alternative algorithm, based on 
an eigenvalue decomposition of the smoothing matrix S\ (also an order O(n^) 
algorithm), is faster when combined with GCV for selecting the number of 
iterations. We have implemented the latter algorithm in the ibr package. This 
approach is easily understood and implemented for thin plate spline smoothers, 
whose smoothing matrix S\ is symmetric. For kernel smoothers, the smoothing 
matrix is not symmetric and further discussion is needed. 

While the kernel base smoother S\ is not symmetric, we can rewrite equation 
(4) using an cigen decomposition of a symmetric matrix. Specifically, write S\ = 
DK, where K is symmetric matrix with general element = f^^^j^ K {{Xi^ — Xjk)/hk} 
and D a diagonal matrix with entries Da = 1/X]?=i'^ij- With this notation 
we write the smoothing matrix of ifik in Equation (4) as 

I-{I-Sx)'' = I-{I-DK)'' 

where A = D^^^KD^^^ . The latter is symmetric, and so can diagonalizcd A = 
UAU', with U the orthogonal matrix of eigenvectors and A the diagonal matrix 
of eigenvalues. Equation (4) becomes 

rhk ^ D^''^U{I ~{I - Kf)U'D-^''^Y. 

The coefRcient in (8) becomes 

h = D^''^U[I + {I-K) + {I-Kf + --- + {I-Kf-^]U'D-^/'^Y. 

Recognizing the sum inside the bracket as the fc — 1 first term of geometrical 
series, we rewrite 

pk = D^/^UK-\l -{I- Kf)U'D^/^. 

The core of computation becomes the cigcn decomposition which is done in a 
very efhcient way by the function eigen for moderate n (n < 1000 for instance). 
For additional efficiencies, the computations of A and D^^^ are done in C for 
the default Gaussian kernel. 
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3.3. Stopping rules 

The ibr package implements several classical criteria to empirically select an 
optimal number k of bias correction iterations. They include Generalized Cross 
Validation (GCV), the Akaike Information Criteria (AIC), the Bayesian Infor- 
mation Criteria (BIC), a corrected Akaike Information Criteria (AICc) and gen- 
eralized Minimum Description Length (gMDL). The choice for which method 
is to be used is controlled by the argument criterion of the ibr function, 
with the default method being GCV. Cross-validation are also available, but 
our discussion of that method is postponed to Section 3.5. 

The evaluation of an optimal number k of iterations using any one of these 
classical criteria is not a trivial task. The package ibr implements both a compu- 
tationally burdensome exhaustive search method and a computationally efficient 
but approximate method. The latter is the default method. The user can request 
ibr to perform an exhaustive search by setting the argument exhaust ive=TRUE 
in the list control. par. 

3.3.1. Exhaustive search method 

The exhaustive search method evaluates, for each k in an interval [/^min! ^^^max], 
the criterion to identify its global minimizer. The default values for the range 
are /Tmin = 1 and i^max = 10^. 

3.3.2. Numerical optimization method 

The default method relies on the fact that the criteria is easily calculated for 
arbitrary k G M^. This enables us to use standard optimization routine to 
minimize the criterion. While this approach is conceptually simple, there are 
two pitfalls: First, most criterion break down for very large k for which the 
smoother essentially interpolates the data, i.e., rhk ~ Y . Second, some criterion 
exhibit multiple local minima (see figure 3b). 

All model selection criteria trade-off goodness of fit, as measured by logdjF — 
WfelP) with a measure of the complexity of the smoother. Numerical difficulties 
arise when Y « m^, which occurs when the number k of iterations is close to 
the sample size n. To overcome this problem, we bound from above the maxi- 
mum allowable number of iterations by setting the variable dfmaxi in the list 
control .par. By default, its value is 27T./3. Hard-coded error handling prevents 
evaluation of the criteria when either k > n(l — 10~^°) or ||y — m^^lp < 10~^°. 
These exceptions also apply to the exhaustive search algorithms. 

Classical model selection criteria have been developed in the context where 
the effective number of estimated parameters is significantly smaller than the 
number of observations. Investigation of the criteria, as a function of effective 
degrees of freedom over a broader range of values reveals the presence of multi- 
ple local minima. While this does not impact the performance of the exhaustive 
search, the presence of local minima is potentially problematic for standard 
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minimization algorithms. Our solution is to divide the interval [ii'min; -?^max] 
into smaller subintervals and apply on each subinterval a numerical optimiza- 
tion using the function optimize, and the minimizer of these minimizations 
is returned. The splitting is controlled by the argument fraction in the list 
control. par, with default value of c(100, 200, 500, 1000, 5000, le04, 
5e04, le05, 5e05, le06). 




iterations iterations 
(a) Typical evolution of GCV with k. (b) Evolution of GCV with k with two 

minima figured by arrows. 

Fig 3: Evolutions of GCV with the number of iterations k 

While the strategy of optimizing the criteria in subintervals is more expensive 
than optimizing over the original interval, it remains significantly faster than 
performing an exhaustive search. 

3.4- Scales of variables 

The function ibr is designed to be used with two types of linear smoothers: 
thin-plate splines and kernel smoothers. Thin-plate splines are governed by a 
single parameter A that weights the contribution of the roughness penalty. As a 
result, it is desirable to scale all the variables to have equal variance to ensure 
that the roughness penalty is applied equally to each variable. This is achieved 
by pre-processing the data with the scale function before applying smoothing 
the data with ibr. 

Our implementation of the kernel smoother enables the use of a vector of 
different bandwidths, one for each of the regression variables. While the discus- 
sion on scaling applies when a common bandwidth is used for all the variables, 
we found in our numerical experiments that we get better results when we use 
the original variable but select a suitable bandwidth for each variable. The ob- 
jective is not to select an optimal bandwidth, but rather control the amount of 
smoothing we do at each iteration. To this end, we propose to select the band- 
widths such that one-dimensional smoothing matrix for each variable has the 
same effective degree of freedom. Typical values for the effective degree of free- 
dom values are 1.05, 1.1, 1.2, 1.5 or 2, which the user sets with the df argument. 
Given an desired effective degree of freedom, package automatically determines 
the bandwidth using an adaptation of the uniroot algorithm programed in C. 
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Relating the effective degree of freedom of each of the univariate components 
to an effective degree of freedom for the multivariate smoother is non-trivial. As 
a result, some users may prefer to control the overall smoothing instead of the 
marginal smoothing of each of component. To enable (or disable) the control 
of the overall smoothing, the flag df total in list control. par have to be set 
to TRUE (the default value of that flag is df total=FALSE). With this option, 
our package takes the value of the argument df to calculate the individual 
bandwidths of each component using a C routine. 

3.5. Stopping rules: K-fold cross-validation and Data splitting 

Simple cross-validation, K-fold cross-validation, and more generally data split- 
ting, are well established techniques for model selection that we use to determine 
the optimal number of iteration k for our iterative bias correction scheme. For 
these methods, the data arc separated into two sets, a training set to estimate 
the regression function and a testing set to evaluate the out of sample prediction 
error, using either the root mean square error criterion="rmse" or the mean 
absolute error criterion="map" loss functions. We numerically minimize that 
prediction error, either using an optimization routine, the default method, or 
by exhaustive search (set exhaust ive=TRUE in the list control .par). 

Since simple leave-one out cross-validation usually leads to estimator that 
under-smooths (in our case, the selected number of iterations k is larger than 
the optimal one), we prefer to use either data splitting or i^-fold cross-validation. 
The main difference between these two procedures is that usually data splitting 
is conducted once (except if the user asks for more using argument npermut) 
whereas for A'-fold cross-validation, the original sample is randomly partitioned 
into K subsamples with each of the K subsets used as the test set and the 
remainders K — 1 subsets are combined to form the training set. The prediction 
error is then computed by averaging the errors across the K trials. In summary, 
we split the original data into two samples : a training one on which we evaluate 
the estimator and a testing one on which we predict the new observations as 
shown in figure 4. 

The list cv. options in ibr controls the various options for cross-validation, 
including the size of the training set, the number of repetition of the procedure, 
the loss function and the type of splitting. 

3.5.1. Selecting the number of iterations k with Data splitting 

To have ibr perform a data splitting cross-validation, we set the following op- 
tions in cv. options: 

1. Input either ntest or ntrain, the size of the testing set Uv or the size 
of the training test (n — nu), respectively. The default value set ntest to 
Ln/lOj. 

2. Set the number of times the dataset is split in npermut. For classical data 
splitting, npermut have to be set equal to one (the default value is 20). 
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X 



Data set 



Splitting 




Computation of 5^ 
yt=l,2,3,.... 



k=l,2,3,. 



RMSE(k)=\ln:)\Y^,n^,^\\l 



-1, 



MAP(k)= «, ||(y,^^p/yj|^ 

Fig 4: Training set and validation set 



3. Set the type equal to raiidom to enable random data splitting. This stage 
can be omitted as this is the default value. The argument seed can be 
used to control the seed of the random generator. 

Data splitting (with test set of size L?i/10j) with root mean square error loss is 
achieved by the code 

> ibr(X, Y, criterion="rmse" , cv. options=list (npermut=l) ) 

A more complex example of data splitting that uses 100 samples of 3 observa- 
tions to evaluate the prediction error using the mean absolute deviation loss is 
achieved with the code 

> ibr(X, Y, criterion="map" , cv. options=list (ntest=3,npermut=100) ) 

3.5.2. Selecting the number of iterations k with K-fold cross-validation 

To perform a ii'-cross- validation with ibr, we set the following options in 
cv . options: 

1. Set Kf old=TRUE (default is FALSE) or set Kf old equal to the number 
of folds 
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2. Set the number of folds K. One can either specify the size of the testing 
set Hy in ntest or the size of the training set {n — riy) in ntrain, in which 
case the fold is computed to he K = [n/n„J. One can set the number 
of folds K using by setting the argument Kf old equal to K. This implies 
that the size of the testing set is [n/K\. 

3. Specify the type of data-split. By default, the data are split randomly 
(type="rELndom" ). Alternatively, we divide the data using consecutive stretch 
of data (type=" consecutive") or interleaved split (type=" interleaved"). 
A forth option, type="timeseries" divides the data chronologically and 
uses the last [n/K \ for the testing set. The splitting implied by consecutive 
is shown in figure 5a while the splitting using interleaved is shown in 
figure 5b. The obvious case of random draw is not shown. Finally, the op- 
tional argument seed can be used to control the seed for random number 
generator. It is given as seed argument of the set . seed function. 




Fold 1 
Fold 2 
Folds 
Fold 4 
Folds 

Fold 6 k \ K \ ■■: ■-. \ \ 



-. \ | \ \ ■■■■ \ \ \ ■■■■ \ \ \ ■■■■ \ \ \ \ 



1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 



(a) Graphical summary of consecutive (b) Graphical summary of interleaved K - 
K = 6 folds cross-validation. 



folds cross-validation. 



Fig 5: Options consecutive and interleaved for K-fold cross-validation 



The first two Unes of code give rise to the examples summarized in 5a and 
5b, while the third line corresponds to a random K-fold cross-validation: 

> ibr(X,Y, criterion="irmse" , cv. options=list (Kfold=6 , type=" consecutive ")) 

> ibr(X,Y, criterion="irmse" , cv. options=list (Kfold=6 , type= " interleaved" ) ) 

> ibr(X, Y, criterion="rmse" , cv. options=list (Kfold=6 , type= " random" ) ) 

Finally, if the user wants to perform an exhaustive search for the number of 
iterations (from 1 to 1000 iterations) using the leave-one out cross-validation, 
she runs 

> ibr CX, Y, criterion="rmse" ,Kmax=1000 , control .par=list (exhaustive=TRUE) , 
+ cv . options=list (Kfold=TRUE ,ntest=l , type=" consecutive" ) ) 
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3.6. Variables selection 

We can apply the standard strategy of balancing prediction errors and model 
complexity to select predictors. The main issue with variable selection with ibr 
is computational, as we wish to compare models using an optimal number of 
bias reduction iterations. To limit fitting models with many parameters, we only 
consider forward variable selection (sec algorithm 1). 

In analogy to selecting the number of iterations, controlled by entries in the 
list criterion, we control the variable selection procedure with the list varcrit. 
The latter has the same default values as the former. 



Algorithm 1 Forward function 

Require: criterion (GCV, AIC, AICc, BIG, gMDL, MAP or RMSE) 
Require: varcrit (GCV, AIC, AICc, BIG, gMDL) 
s <~ I # current stage 

R matrix of infinity with d columns # Matrix of results 

iS # variable(s) selected at current stage 

*min ^ oo # Current minimum of criterion 

for s = 1 to d do 

for _;' = 1 to d such that J 5 do 

Sc <~ S U {j} # Adding one variable to the set of variables already selected 

res <- ibrCX^^ ,Y, criterion) # Xg^ is the dataset with explanatory variables in Sc 

evaluation of criterion varcrit for res: Rgj 
end for 

if all {Rsj}j > Smin then 

Return matrix R. from row 1 to s — 1 
else 

# Updating 

5 <— iS U {argmiuj Rsj} 

Smin miuj Rsj # Current minimum of criterion varcrit 
s <— s + 1 
end if 
end for 



The forward function returns an object of class f orwardibr. A plot method 
is provided for this class of object. 

4. Examples 

Let us return to the Wcndelberger's test function (see equation (5)): 

> f <- function(x, y) { . 75*exp(- ( (9*x-2) '2 + (9*y-2) '2) /4) + 
+ .75*exp(-((9*x+l)'2/49 + (9*y+l) '2/10) ) + 
+ .50*exp(-((9*x-7)~2 + (9*y-3) ~2) /4) - 
+ .20*exp(-((9*x-4)~2 + (9*y-7)~2)) } 

We start by plotting this function on a 50 x 50 grid of points in the unit square 
(0, 1) X (0, 1) that produces Figure 1. 
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> ngrid <- 50; xf <- seq(0,l, length^ngrid+2) [-c (1 ,ngrid+2)] 

> yf <- xf ; zf <- outer (xf, yf, f) 

> grid <- cbind(rep(xf , ngrid), rep(xf, rep(ngrid, ngrid))) 

> perspCxf , yf, zf, theta=130 , phi=20, expand=0 . 45 ,main="True Function") 

Next, we can generate a dataset of 100 noisy observations of the function / 
evaluated on the regular grid {0.05, 0.15, . . . , 0.85, 0.95}^, with Gaussian distur- 
bances that have zero mean and standard deviation producing a signal to noise 
ratio of five. 

> noise <- .2 ; N <- 100 

> xr <- seq(0 . 05,0 . 95,by=0 . 1) ; yr <- xr ; zr <- outer (xr ,yr ,f) ; set . seed(25) 

> std <- sqrt(noise*var (as. vector (zr))) ; noise <- rnorm (length (zr) ,0 ,std) 

> Z <- zr + matrix (noise, sqrt (N) ,sqrt(N)) 

Concatenate the explanatory variables into a 100 x 2 matrix that results in the 
objects X and Zc. 

> xc <- rep(xr, sqrt(N)) ; yc <- rep(yr, rep(sqrt(N) ,sqrt(N))) 

> X <- cbind(xc , yc) ; Zc <- as .vector (Z) 

In this example, we will use thin-plate splines of order vq. Since the procedure 
is adaptive, the default value is the smallest possible smoothness, which is 2 in 
our case. The effective degree of freedom of the thin plate smoother needs to 
be slightly larger than Mq = C'Jj^^Y^)- '-'^^ example, M = 3 and we chose A 
such that the effective degree of freedom was 1.1 x M = 3.3. Figure 2 (a) graphs 
the base smoother at iteration zero. 

> res.ibr <- ibr(X,Zc ,df=l . 1 , control .par=list(iter=l) ,smoother="tps") 

> fit <- matrix(predict (res . ibr , grid) , ngrid, ngrid) 

> persp(xf, yf, fit ,theta=130,phi=20,expand=0.45,main="Fit" ,zlab="fit") 

Figure 2 (b) and (c) show the bias corrected smoother after 500 and 50,000 
iterations. To compute the smoother whose number of iterations is selected 
with GCV, we use 

> res.ibr <- ibr(X,Zc,df=l.l,smoother="tps") 

> summary (res . i br) 

The summary output of the resulting smoother prints the residuals standard 
error, the degree initial freedom and reveals that the final degree of freedom is 
26.5 and the value of (log) GCV is -3.63 after iterations fcccy = 424 iterations. 

Residuals : 

Min IQ Median 3Q Max 

-0.235036 -0.068252 -0.007412 0.069061 0.301478 
Residual standard error: 0.1197 on 73.5 degrees of freedom 

Initial df : 3.3 ; Final df : 26.5 

gcv 
-3.63 
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Number of iterations: 424 chosen by gcv 

Base smoother: Thin plate spline of order 2 (with 3.3 df) 

To compute the fitted values, we use the predict function 

> predict Cres . ibr) 

that can be used to compute the Mean Absolute Error (MAE) on a grid 

> mean Cabs ("predict (res . ibr , grid) -as . vector (zf))) 
[1] 0.0578394 

To plot the fitted value, we employ the code 

> predgrid <- matrix(predict(res. ibr , grid) ,ngrid,ngrid) 

> persp (xf , yf , predgri d , theta=130 , phi =20 , expand=0 .45 ,zlab= "f i t") 



Fig 6: Fitted regression function rhk{xx,X2) on the unit square [0, 1] x [0, 1], the 
number of iteration is chosen by GCV: kccv — 424. 

To use either the AICc or the BIC criterion to select the number of iterations, 
we write 

> res.ibr.aicc <- ibr (X,Zc, df =1 . 1 , sii!ootiier="tps crit="aicc".) 

> res.ibr.bic <- ibr(X,Zc,df=l . l,smootiier="tps", crit="bic".) 

Direct display of an ibr object gives the following short description 

> res.ibr.aicc 

Initial df: 3.3 ; Final df : 20.98 
Number of iterations : 247 chosen by aicc 

> mean (abs (predict (res . ibr . aicc, grid) -as. vector (zf) ) ) 
[11 0.0583159 

which reveals that AICc required 247 iterations and the resulting smoother has 
a slightly larger than mean absolute error then what we obtained using GCV. 
This last MAE is close to the thin plates spline smoother with A (not k) selected 
with GCV 

> res . tps <- Tps (X, Zc) 

> mean (abs (predict (res . tps, grid) -as . vector (zf))) 
[1] 0.05823783 
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4.1. Real example: Los Angeles Ozone Data 

We consider the classical Los Angeles basin ozone concentration data set used 
by numerous authors (see for example Breiman [1996], Biihlmann and Yu [2003, 
2006]) to demonstrate the performance of various high dimensional smoothing 
techniques. The data consists of n = 330 observed ozone concentration related 
to d = 8 explanatory variables. 

The order i/q of thin plate splines needs to be greater than d/2, that is I'o = 5. 
This implies that the minimal effective degree of freedom of the thin plate spline 
smoother S\ is Mq = 495, which is greater than the sample size n. Even for 
larger sample sizes, say n = 500, the thin plate splines will be unsatisfactory 
base smoother (recall that in the preceding section, for d = 2 we started at 3.3 
df with 100 observations). 

For this reason, let us consider the (default) Gaussian kernel smoother. As 
we discussed in Section 3.4, we do not scale the eight explanatory variables but 
instead select the bandwidth of each univariate smoother to achieve a smoothing 
matrix that has an effective degree of freedom of 1.1. This ensures that at face 
value, each of the eight covariates has the same influence. The number of possible 
bias correction iterations k considered by the model selection procedure for 
selecting the optimal number of iterations lies between one and 10, 000 (default 
values for Kmin and Kmax). The R code for fitting this data is 

> data (ozone) 

> res.ibr <- ibr(ozone[,-l] ,ozone[,l] ,df=l. 1) 

> summary (res . i br) 

Residuals : 

Min IQ Median 3Q Max 

-13.5581 -2.0566 -0.3481 1.9816 12.6049 
Residual standard error: 3.946 on 309.6 degrees of freedom 

Initial df; 2.06 ; Final df : 20.42 

gcv 
2.873 

Number of iterations : 64 chosen by gcv 

Base smoother: gaussism. kernel (with 2.06 df) 

From the summary, we see that the optimal number of iterations is kccv = 
64, which can be thought as quite low (recall that in the previous example 
the number of iterations ranged between 200 and 400). In this example, an 
exhaustive search method for determining the optimal number of iterations 

> ibr (ozone [,-1] , ozone [, 1] , df=l . 1 , control .par=list ( exhaust ive=TRUE) ) 

gives the same result. Because we only need a relatively small number of bias cor- 
rection steps, we can select a smaller initial effective degree of freedom, say 1.05, 
while maintaining the computational complexity at a manageable level. Indeed, 
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decreasing the effective degree of freedom of tlie pilot smootlier increases the 
total number of bias reduction steps while typically providing some performance 
gains as measured by out of sample prediction errors. 

A plot method is also available for the ibr object to display the residuals as 
a function of the index. 

> plot (res . ibr) 




5 10 15 20 25 30 

Fit 



Fig 7: Index plot of residuals. 



To emulate [see Biihlmann and Yu, 2003] and draw 50 random splits of the data 
into a set of 297 training data and a set of 33 testing data, we issue the following 
commands 

> XX <- ozone [,-1] 

> Y <- ozone [, 1] 

> erreurl.5 <- rep(0, 33*50) 

> aa <- c (1,945095059, 162152953) 

> ford in 1:50){ 

+ set. seed (aa+i) 

+ ind <- sample (1:330, 33) 

+ XXA <- XX [-ind,] 

+ YA <- Y[-ind] 

+ XXT <- XX [ind,] 

+ YT <- Y[ind] 

+ res. ibr <- ibr (XXA, YA) 

+ erreurl.5[(33*(i-l)+l) : (33*i)] <- YT-predict(res.ibr,XXT) 
+ } 

> print (mean (erreurl . 5 ~2) ) 

We get an error of 14.98, which compare favorably with GAM (mgcv: 17.44), 
MARS (mda: 17.49), projection pursuit (ppr: 17.79 for nterms=2) or boosting 
(package mboost: 17.23). Note that since no default is available for the nterms 
argument of the function ppr, we follow the examples provided in the ppr doc- 
umentation and have set nterms equal to 2. To summarize, the ibr smoother 
enjoys a 15% reduction in the out of sample prediction mean squared error over 
other state-of-the-art multivariate smoothing methods. 

We note that the above comparison favors the L2 boosting and MARS algo- 
rithms that take advantage of built-in variable selection procedures. To compare 
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to these methods, we apply the forward variable selection using the random 
splitting method to ibr for this data set issuing the following commands. 

> aa <- c (1,945095059, 162152953) 

> i <- 1 

> set . seed (aa+i) 

> ind <- sample (1:330, 33) 

> XXA <- XX [-ind,] 

> YA <- yf-indj 

> XXT <- XX [ind,] 

> YT <- Y[ind] 

Wc select variables using the commands 

> forward. ibr <- forward (XXA, YA) 

> varnumber <- apply (forward. ibr , 1 ,which.min) 

> varnumber 
[1] 4 3 7 6 5 

That is, the order of the variables to be included into the model is 4, 3, 7, 
6 and 5. Variable selection leads to improved predictions. To quantify, on the 
testing set, this improvement, we compare the the prediction MSE of the selected 
five variable model with the prediction MSE of model that uses all the eight 
variables. 

> res. ibr <- ibr (XXA, YA) 

> mean ( (YT-predict (res. ibr, XXT) )~2) 
[1] 22.90836 

> res.ibr2 <- ibr (XXA [, varnumber] ,YA) 

> mean ( (YT-predict (res . ibr2 , XXT [ , varnumber] ) ) ~2) 
[1] 21.12792 

This shows a small improvement. In conclusion for this example, we remark that 
despite the increased computational time, the forward function provides simple 
and useful tool for selecting variables. 



5. Conclusion 

The ibr package provides additional features which are not offered by other 
packages on CRAN. These features are a complete implementation, using R 
language, of iterative biased reduction procedure which implement and gener- 
alize the twicing idea of Tukey [1977]. 

This method of smoothing for multivariate dataset seems to be promising 
especially on real dataset. But one limitation of this smoothing method is the 
use of matrix n x n, where n is the number of observations. Moreover, at the 
present time, the computational bottleneck is the eigen decomposition of an 
n X n matrix, which limits the size of the dataset to which this procedure can 
be applied too. 
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